The importance of data visualisation

“The simple graph has brought more information to the data analyst’s mind than any other device.” — John Tukey

Data visualisation is a critical, and often overlooked, step in data analysis. It is an essential tool of data analysis (discovering patterns in the data), but also of scientific communication.

Consider the following example. Look at this dataset with attention, and see if you can detect any obvious relationship in the data:

Raw data is very challenging for the brain, and obfuscate the analysis. It takes a long time for us to process these raw numbers — because this is not how we work.

Now, consider the same datasets simply plotted:

x y
1.39 -1.92
0.25 -0.34
0.53 -0.12
0.91 -2.00
1.73 -1.69
0.61 -1.92
0.27 -1.68
0.94 0.00
1.40 -0.08
0.02 -1.20

Easier?

How Humans See Data by John Rauser, 2016.

Graphics systems in R

There are many different graphics systems in R. The three main ones are:

  • base (included in “vanilla” R)
  • lattice
  • ggplot2

In this tutorial, we will study exclusively ggplot2. Why?

  • While base makes simple plots easy, making publication ready plots is really hard,
  • ggplot2 is a framework: once you understand it, it is very flexible,
  • it is widely acknowledged as the highest standard in data visualisation in R,
  • it is in active development, with many extensions available.

With ggplot2, you can do more, faster.

The Grammar of Graphics

ggplot2 is based on a conceptual framework for data visulaisation called the Grammar of Graphics (Leland Wilkinson, 1999):

The Grammar of Graphics (Leland Wilkinson, 1999)

The Grammar of Graphics (Leland Wilkinson, 1999)

For more detail on the grammar of grpahics, see “A layered grammar of graphics”, by Hadley Wickham.

Data visualisation with ggplot2

The grammar of graphics is a language for describing graphs. Let’s explore the synthax!

First steps

First let’s load the ggplot2 package, and get some test data. We can load the ggplot2 package using the library function:

library(ggplot2)

Then, the test data. For the purposes of this course, we will load some demonstration data embeded in the aqp package. This data, named sp6 (for “soil profiles #6”) contains analytical data collected on a range of soil profiles:

data(sp6, package = 'aqp')
head(sp6)
##    id  name top bottom         color texture sand silt clay    Fe   Mn
## 1 A-1    Ap   0     24 7.9YR 2.7/2.0  CN-SiL 35.6 50.9 13.4  49.4 11.0
## 2 A-1    BA  24     45 7.6YR 2.8/2.3       L 35.6 43.0 21.3  52.5  9.2
## 3 A-1   Bt1  45     65 8.0YR 3.7/2.8    CN-L 39.3 34.6 26.1  42.1  3.4
## 4 A-1   Bt2  65    104 6.8YR 2.4/1.8      CL 25.9 39.2 34.9  90.0 28.4
## 5 A-1 Bt/BC 104    185 6.3YR 2.4/1.5      CL 23.4 42.4 34.3  88.9 39.5
## 6 A-1 Bt/BC 185    185 6.1YR 2.1/0.9      CL 33.4 37.4 29.2 101.0 75.0
##      C   pH   Db
## 1 16.2 6.76 1.27
## 2  6.0 6.73 1.28
## 3  1.4 6.76 1.44
## 4  1.4 6.40 1.33
## 5  1.3 4.82 0.75
## 6  1.2 5.40 0.69

To simplify the dataset, we will create a new column named hz (for “horizon”) based on the name column. Basically, we simplify the detailed horizon describtion contained in the column name and take only the first letter of that horizon description:

library(stringr)

sp6$hz <- str_extract(sp6$name, '[A-Z]')

head(sp6)
##    id  name top bottom         color texture sand silt clay    Fe   Mn
## 1 A-1    Ap   0     24 7.9YR 2.7/2.0  CN-SiL 35.6 50.9 13.4  49.4 11.0
## 2 A-1    BA  24     45 7.6YR 2.8/2.3       L 35.6 43.0 21.3  52.5  9.2
## 3 A-1   Bt1  45     65 8.0YR 3.7/2.8    CN-L 39.3 34.6 26.1  42.1  3.4
## 4 A-1   Bt2  65    104 6.8YR 2.4/1.8      CL 25.9 39.2 34.9  90.0 28.4
## 5 A-1 Bt/BC 104    185 6.3YR 2.4/1.5      CL 23.4 42.4 34.3  88.9 39.5
## 6 A-1 Bt/BC 185    185 6.1YR 2.1/0.9      CL 33.4 37.4 29.2 101.0 75.0
##      C   pH   Db hz
## 1 16.2 6.76 1.27  A
## 2  6.0 6.73 1.28  B
## 3  1.4 6.76 1.44  B
## 4  1.4 6.40 1.33  B
## 5  1.3 4.82 0.75  B
## 6  1.2 5.40 0.69  B

Remember how I said that base graphics make it easy to create a simple graph, but hard to make a complex one? Here is a simple base graph (a scatterplot).

plot(sp6$pH, sp6$clay)

It is simple, but it is ugly! The objective of this part of the course is to give you the tools to easily produce this kind of figures:

Back to our first steps in ggplot2. If you remember the Grammar of Graphics, we first attach data to a plot:

ggplot(
  data = sp6
)

This creates a blank plot — data is attached to it, but not represented graphically.

Then, we define the roles that each variable of that dataset will play (aesthetics). This is done through the mapping option, which uses the aes function:

ggplot(
  data = sp6, 
  mapping = aes(x = pH, y = clay)
) 

Still nothing plotted! But you do see there is a coordinate system defined on the plot now. This is based on the aesthetics we provided (x being pH, y being clay).

To create a plot, we need to define a geometry that will represent those variables in the canvas. This is done using the geom_* family of function. For this simple plot, we will use a point geometry:

ggplot(data = sp6, mapping = aes(x = pH, y = clay)) + 
  geom_point()

Aesthetics

We now have our plot! We can define additional aesthetics, such as colour or size:

ggplot(data = sp6, aes(x = pH, y = clay)) +
  geom_point(
    aes(colour = hz)
  )

ggplot(data = sp6, aes(x = pH, y = clay)) +
  geom_point(
    aes(size = Fe)
  )

Other aesthetics include:

  • x
  • y
  • alpha (opacity)
  • colour
  • fill
  • group
  • shape
  • size
  • stroke (for lines)

Different aesthetics can also be combined:

ggplot(data = sp6, aes(x = pH, y = clay)) +
  geom_point(
    aes(colour = hz, size = Fe)
  )

Also, you can use expressions when defining aesthetics:

ggplot(data = sp6, aes(x = pH, y = log(C))) +
  geom_point()

Geometries

A geometry (geom) is the geometrical object that a plot uses to represent data. Geometries are often used to describe plots: bar charts use bar geoms, line charts use line geoms, boxplots use boxplot geoms, etc.

Different geometries will convey the dataset in different ways:

On the left hand side, data is represented by points (geom_point), while on the right hand side, data is represented by a smoothed curve (geom_smooth).

# left
pl <- ggplot(data = sp6) + 
  geom_point(
    mapping = aes(x = pH, y = log(C))
  )

# right
pr <- ggplot(data = sp6) + 
  geom_smooth(
    mapping = aes(x = pH, y = log(C))
  )

ggplot2 provides geometries for most, if not all, types of plots.

From R for Data Science, Wickham and Grolemund, 2017. From R for Data Science, Wickham and Grolemund, 2017. From R for Data Science, Wickham and Grolemund, 2017.

All of these geoms are constructed using a set of fundamental geometries:

From R for Data Science, Wickham and Grolemund, 2017.

From R for Data Science, Wickham and Grolemund, 2017.

Here are a few examples of these geometries in action:

# Histogram
ggplot(data = sp6, aes(x = pH)) +
  geom_histogram()

# Probability density function
ggplot(data = sp6, aes(x = pH)) +
  geom_density()

# Boxplot
ggplot(data = sp6) +
  geom_boxplot(aes(x = hz, y = pH))

Different geometries can be combined in the same plot:

ggplot(data = sp6, aes(x = pH, y = clay)) +
  geom_point() +
  geom_smooth()

Remember that geometries can take different aesthetics too!

ggplot(data = sp6, aes(x = pH)) +
  geom_density(
    mapping = aes(fill = hz), 
    alpha = 0.3
  )

Facetting

Facetting is another layer in the Grammar of Graphics: it relates to data groupings. This data visualisation techniques is very useful to split a graph into sub-graphs according to certain groups in the data.

Two different facetting tools are available in ggplot2:

  • facet_wrap creates a 1-D ribbon of panels
  • facet_grid creates a 2-D matrix of panels

The arguments of these functions are formulas that look like:

  • ~variable for facet_wrap
  • var1 ~ var2 for facet_grid
ggplot(data = sp6) +
  geom_point(aes(x = pH, y = clay, shape = hz)) +
  facet_wrap(~hz, ncol = 1)

ggplot(data = sp6) +
  geom_point(aes(x = pH, y = clay, shape = hz, colour = id)) +
  facet_grid(hz ~ id)

Themes and the production of publication ready figures

Themes are the last layer in the Grammar of Graphics: they relate to the visual design aspects of the graph. Interestingly, this is probably the part that is the most foreign to scientists!

You can apply a theme by adding a function from the theme_* family. The default theme is theme_gray(). We can change it so to apply a black and white theme using theme_bw():

ggplot(data = sp6, aes(x = pH, y = clay)) +
  geom_point(aes(colour = hz))

ggplot(data = sp6, aes(x = pH, y = clay)) +
  geom_point(aes(colour = hz)) + 
  theme_bw()

Part of the themes are the different labels (title, subtitle, etc.) of the graph. They can be controlled by the labs function:

ggplot(data = sp6, aes(x = pH, y = clay)) +
  geom_point(aes(colour = hz)) +
  labs(
    x = "pH",
    y = "Clay (%)",
    title = "pH vs. Clay for 3 different horizons",
    subtitle = "This is a subtitle with a clever explanation",
    caption = "Data from: Bourgault and Rabenhorst, 2011."
  )

Modularity

Not that all the ggplot2 layers can be stored and added (using +) in a very modular way:

p <- ggplot(data = sp6)

p + geom_point(aes(x = pH, y = clay))

p + geom_point(aes(x = pH, y = clay)) + theme_bw()

p1 <- p + geom_point(aes(x = pH, y = clay))

Extensions: the ggplot2 ecosystem

The ggplot2 community is very active, and people are contributing R packages that can enhance the core capabilities of ggplot2. Here are just a few examples below.

Additional themes

Some packages provide additional themes. These are great as they allow scientists to focus on their science without having to dig too deep on technical (yet critical!) such as colour choices or typography:

library(ggthemes)
library(hrbrthemes)

p <- ggplot(data = sp6) +
  geom_point(aes(x = pH, y = clay, colour = hz)) +
  labs(
    x = "pH",
    y = "Clay (%)",
    title = "pH vs. Clay for 3 different horizons",
    subtitle = "This is a subtitle with a clever explanation",
    caption = "Data from: Bourgault and Rabenhorst, 2011."
  )
  
# Theme from "The Economist"
p + 
  scale_colour_economist(name = "Horizon") + 
  theme_economist()

# Another theme with great typography
p + 
  scale_colour_ipsum(name = "Horizon") + 
  theme_ipsum()

Animations

The gganimate package provide a new aesthetic called frame, and generates an animated GIF.

First, you create a ggplot:

library(gganimate)

p <- ggplot(sp6, aes(x = pH, frame = hz)) +
  geom_density() 

The gganimate command launches the actual animation:

gganimate(p)
Animated plot

Animated plot

Interactive plots

The plotly package is another data visualisation package that generates interactive graphics in Javascript (using web tecnologies).

The command ggplotly can convert very easily your ggplot into an interactive graphic:

library(plotly)

p2 <- ggplot(data = sp6) +
  geom_point(aes(x = pH, y = clay, colour = hz))
ggplotly(p2)
ggplotly(p)

For more information on ggplot2